Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Add optimized UTF-8 validation and transcoding apis, hook them up to UTF8Encoding #21948

Merged

Conversation

GrabYourPitchforks
Copy link
Member

@GrabYourPitchforks GrabYourPitchforks commented Jan 11, 2019

This is the first batch of the improved UTF-8 validation and transcoding APIs. There's a single workhorse method used for validation and for counting the number of code units that would result from transcoding.

Philosophically, these methods differ from existing methods on System.Text.Encoding. The existing Encoding methods really push consumers to transcode all data to UTF-16 so that they can take advantage of existing UTF-16 data manipulation and inspection APIs. Instead, these new APIs expose information about the underlying structure of the UTF-8 data itself. The intent is that error handling ("I found an invalid subsequence in the stream; what do I do now?") is pushed to a higher layer. That higher layer can be built on top of this API and other inspection APIs in order to provide the desired error handling semantics. This will be clearer when the OperationStatus-based APIs come online in a future PR.

Benchmarks are pending.

Unit tests at dotnet/corefx#34538.

Resolves https://github.com/dotnet/corefx/issues/36163.

@GrabYourPitchforks
Copy link
Member Author

One thing I struggled with here is how to balance readability / safety of code vs. performance. Since validation and transcoding are going to be such common operations, I weighted the decision heavily on the side of highest possible performance. This should be fleshed out by the benchmarks when I rerun them.

Many of the more obtuse lines of code exist because they cause the JIT to emit very specific and optimized assembly. I'm hoping that a combination of code comments and test coverage makes the code more maintainable.

@GrabYourPitchforks
Copy link
Member Author

/cc @ahsonkhan and @bartonjs who may find this interesting because it'll be followed by the transcoding and escaping APIs you've been asking for. :)

@mikedn
Copy link

mikedn commented Jan 15, 2019

Many of the more obtuse lines of code exist because they cause the JIT to emit very specific and optimized assembly. I'm hoping that a combination of code comments and test coverage makes the code more maintainable.

It would be nice to have a list of these obtuse lines in this PR. Corelib is becoming rather quickly a collection of such hacks and removing them when the JIT improves might not be so easy.

- Hook it up through the existing Utf8 public static APIs
- Move some shared methods out of ASCIIUtility
- Hook it up through the Utf8String ctor
@GrabYourPitchforks GrabYourPitchforks changed the title [WIP] UTF-8 validation apis [WIP] Optimized UTF-8 validation and transcoding apis Mar 27, 2019
@GrabYourPitchforks
Copy link
Member Author

This is still WIP.

Todo:

- Add vectorized UTF-16 validation and transcoded byte counts
- Move Utf16Utility into Unicode namespace alongside Utf8Utility
- Fix some bugs in DecoderNLS's draining logic
@GrabYourPitchforks GrabYourPitchforks changed the title [WIP] Optimized UTF-8 validation and transcoding apis Add optimized UTF-8 validation and transcoding apis, hook them up to UTF8Encoding Apr 3, 2019
@GrabYourPitchforks
Copy link
Member Author

Removed WIP marker. This is now code-complete. There are some tests failing in corefx since they (incorrectly) assume things about the output of transcoding invalid data. I'll send a separate PR through corefx with those fixes. In the meantime, once I build the full list of failing tests that need to be changed I'll also update the global suppression file in this repo.

Benchmarks are pending.

High-level summary of changes:

  • Create vectorized transcoding logic and char counting / byte counting logic.
  • Hook up the fast transcoding logic through the System.Text.Unicode.Utf8 class.
  • Plumb this logic through System.Text.UTF8Encoding.
  • Move internal Utf16Utility type to System.Text.Unicode namespace to live next to Utf8Utility.

@GrabYourPitchforks
Copy link
Member Author

@mikedn I've tried to annotate the "hacks" where they appear throughout the code, including with links back to GitHub issues.

@GrabYourPitchforks
Copy link
Member Author

Quick benchmarks, using various corpus texts from Project Gutenberg.

Method Toolchain Corpus Mean Error StdDev Ratio RatioSD
GetByteCount 3.0-master 11-0.txt 7,087.7 us 137.611 us 121.989 us 1.00 0.00
GetByteCount utf8_1 11-0.txt 2,527.0 us 23.333 us 20.684 us 0.36 0.01
GetBytes 3.0-master 11-0.txt 15,774.5 us 314.180 us 322.640 us 1.00 0.00
GetBytes utf8_1 11-0.txt 10,600.3 us 59.802 us 53.013 us 0.67 0.01
GetCharCount 3.0-master 11-0.txt 6,257.4 us 46.583 us 43.574 us 1.00 0.00
GetCharCount utf8_1 11-0.txt 5,116.3 us 30.386 us 26.936 us 0.82 0.01
GetChars 3.0-master 11-0.txt 16,091.1 us 61.085 us 54.150 us 1.00 0.00
GetChars utf8_1 11-0.txt 12,679.0 us 120.497 us 106.817 us 0.79 0.01
GetByteCount 3.0-master 11.txt 2,192.5 us 17.319 us 13.522 us 1.00 0.00
GetByteCount utf8_1 11.txt 955.8 us 7.903 us 6.599 us 0.44 0.00
GetBytes 3.0-master 11.txt 7,759.5 us 59.931 us 53.128 us 1.00 0.00
GetBytes utf8_1 11.txt 2,303.0 us 19.080 us 17.847 us 0.30 0.00
GetCharCount 3.0-master 11.txt 1,093.5 us 8.459 us 7.063 us 1.00 0.00
GetCharCount utf8_1 11.txt 325.3 us 2.738 us 2.561 us 0.30 0.00
GetChars 3.0-master 11.txt 6,521.0 us 128.918 us 176.465 us 1.00 0.00
GetChars utf8_1 11.txt 1,549.4 us 11.943 us 10.587 us 0.24 0.01
GetByteCount 3.0-master 25249-0.txt 9,129.8 us 23.333 us 19.484 us 1.00 0.00
GetByteCount utf8_1 25249-0.txt 1,166.1 us 6.068 us 5.067 us 0.13 0.00
GetBytes 3.0-master 25249-0.txt 21,678.0 us 81.633 us 76.360 us 1.00 0.00
GetBytes utf8_1 25249-0.txt 10,845.7 us 54.393 us 48.218 us 0.50 0.00
GetCharCount 3.0-master 25249-0.txt 13,441.7 us 84.330 us 74.756 us 1.00 0.00
GetCharCount utf8_1 25249-0.txt 7,063.4 us 71.013 us 66.426 us 0.53 0.01
GetChars 3.0-master 25249-0.txt 29,216.9 us 122.028 us 101.899 us 1.00 0.00
GetChars utf8_1 25249-0.txt 16,593.1 us 115.657 us 96.579 us 0.57 0.00
GetByteCount 3.0-master 30774-0.txt 6,624.6 us 40.983 us 36.330 us 1.00 0.00
GetByteCount utf8_1 30774-0.txt 1,011.6 us 18.997 us 17.770 us 0.15 0.00
GetBytes 3.0-master 30774-0.txt 19,557.6 us 64.271 us 56.975 us 1.00 0.00
GetBytes utf8_1 30774-0.txt 10,282.8 us 104.523 us 92.657 us 0.53 0.01
GetCharCount 3.0-master 30774-0.txt 12,847.1 us 181.605 us 169.873 us 1.00 0.00
GetCharCount utf8_1 30774-0.txt 7,234.3 us 23.763 us 18.553 us 0.57 0.01
GetChars 3.0-master 30774-0.txt 26,399.1 us 119.822 us 100.056 us 1.00 0.00
GetChars utf8_1 30774-0.txt 17,474.5 us 434.245 us 564.641 us 0.67 0.03
GetByteCount 3.0-master 39251-0.txt 12,026.9 us 112.657 us 99.867 us 1.00 0.00
GetByteCount utf8_1 39251-0.txt 1,295.9 us 11.245 us 10.519 us 0.11 0.00
GetBytes 3.0-master 39251-0.txt 31,157.5 us 273.212 us 255.563 us 1.00 0.00
GetBytes utf8_1 39251-0.txt 16,027.7 us 90.350 us 75.446 us 0.51 0.01
GetCharCount 3.0-master 39251-0.txt 20,423.3 us 246.767 us 230.826 us 1.00 0.00
GetCharCount utf8_1 39251-0.txt 13,669.7 us 168.254 us 157.385 us 0.67 0.01
GetChars 3.0-master 39251-0.txt 40,840.9 us 399.104 us 373.322 us 1.00 0.00
GetChars utf8_1 39251-0.txt 28,058.7 us 228.340 us 213.590 us 0.69 0.01

These texts are large (~100KB). I'll work on getting benchmarks for smaller texts as well.

@azure-pipelines
Copy link

Supported commands
     help:
          Get descriptions, examples and documentation about supported commands
          Example: help "command_name"
     run:
          Run all pipelines or a specific pipeline for this repository using a comment. Use
          this command by itself to trigger all related pipelines, or specify a pipeline
          to run.
          Example: "run" or "run pipeline_name"

See additional documentation.

@GrabYourPitchforks
Copy link
Member Author

/azp run coreclr-ci (Build Linux_musl x64 checked)

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

throw new ArgumentNullException("s");
// Validate input parameters

if (chars is null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing identical codegen for == null and is null on string. Why the change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gfoidl I'm seeing that both patterns just reduce to br.true or br.false from the compiler (with both string and array on single or multiple arg). So while it may be true on types overloading operator== other than string, I'm challenging @tannergooding's assertion that it matters.

Once it doesn't matter, we're back to what I once saw on a poster: "change is terrible, unless it's awesome". In this case, if there are two ways of identically doing the same thing, the one with fewer lines of diff is better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is copied + pasted from the ASCIIEncoding class, which is acting as the source of truth for these implementations and will eventually roll out to UnicodeEncoding, UTF32Encoding, etc. It's more lines in this particular diff, but it's overall less diff if you compare ASCIIEncoding.cs and UTF8Encoding.cs against each other.

return GetChars(bytesPtr, bytes.Length, charsPtr, chars.Length, baseDecoder: null);
ThrowHelper.ThrowArgumentOutOfRangeException(
argument: (byteCount < 0) ? ExceptionArgument.byteCount : ExceptionArgument.charCount,
resource: ExceptionResource.ArgumentOutOfRange_NeedNonNegNum);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like you do this enough that you'd have a dedicated demuxer helper.

ThrowHelper.ThrowNeedNonNegNum(
    byteCount,
    ExceptionArgument.byteCount,
    ExceptionArgument.charCount);

...

void ThrowNeedNonNegNum(int firstVal, string firstName, string secondName)
{
    ThrowArgumentOutOfRangeException(
        firstVal < 0 ? firstName : secondName,
        ExceptionResource.ArgumentOutOfRange_NeedNonNegNum);
}

Just to cut down on the size of the boilerplate.

Copy link
Member Author

@GrabYourPitchforks GrabYourPitchforks Apr 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Amusingly enough, it's written this way because it minimized the diff in the ASCIIEncoding.cs case, and the code got copied to this file so that the diff between ASCIIEncoding.cs and UTF8Encoding.cs would be minimized.

There are some other changes / optimizations I want to make here that should further reduce the overhead of these method calls. Will open a new PR so that I can address them in ASCIIEncoding.cs and here at the same time.

if (baseEncoder != null)
if (((decoder is null) ? this.DecoderFallback : decoder.Fallback) is DecoderReplacementFallback replacementFallback
&& replacementFallback.MaxCharCount == 1
&& replacementFallback.DefaultString[0] == UnicodeUtility.ReplacementChar)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing tests are in corefx; but ensure that there's a custom replacement fallback of "(ReplacementChar)!" (StartsWith, not a count of 1) and a custom replacement to "." (right length, wrong char). One to "badger" or some other completely-not-the-thing-we're-looking-for is probably also fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GrabYourPitchforks
Copy link
Member Author

Thank you all @gfoidl @ahsonkhan @bartonjs @mikedn @tannergooding for the thoughtful feedback you've all given throughout this process. I know it was a slog. Your assistance is much appreciated! 🎉

@GrabYourPitchforks
Copy link
Member Author

CI failures are due to #23902, which has already been fixed. Going ahead with this as-is.

@GrabYourPitchforks GrabYourPitchforks merged commit 77a09eb into dotnet:master Apr 12, 2019
@GrabYourPitchforks GrabYourPitchforks deleted the utf8_validation_apis branch April 12, 2019 15:12
@MichalStrehovsky
Copy link
Member

@GrabYourPitchforks could you squash and merge next time? "Clarify comment in 3-byte processing", "Add missing check to 3-byte processing logic", "PR feedback: Fix typos" and similar clutters the git history.

@GrabYourPitchforks
Copy link
Member Author

@MichalStrehovsky I did, including writing an actual real description for the commit, but apparently if GitHub encounters a server-side error during the process it automatically falls back to a normal merge commit. 😒

picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
…UTF8Encoding (dotnet/coreclr#21948)

* Add optimized UTF-8 validation and transcoding logic
- Hook it up through the existing Utf8 public static APIs
- Move some shared methods out of ASCIIUtility
- Hook it up through the Utf8String ctor

* Hook up new UTF-8 logic through UTF8Encoding
- Add vectorized UTF-16 validation and transcoded byte counts
- Move Utf16Utility into Unicode namespace alongside Utf8Utility
- Fix some bugs in DecoderNLS's draining logic

* Improve perf of "is ASCII?" inner loop in UTF-8 validation.

* Remove SSE41.X64 optimization from AsciiUtility
RyuJIT now handles this optimally

* Clarify that vector read is unaligned

* Simplify vectorized logic; remove unnecessary adjustment

* PR feedback: GetElement(0) -> Sse2.StoreLow

* PR feedback
- Simplify CountNumberOfLeadingAsciiBytesFrom24BitInteger
- Extract some consts out to top of file w/ comments

* PR feedback: Enable SSE2 in Utf16Utility code

* Expand masks in Utf8Utility, fix const in fallback path

* Temporarily disable failing CoreFX tests

* Fix incorrect Debug.Assert statements

* Add comments tracking JIT workarounds.

* Rename DWORD -> UInt32 throughout API surface

* Re-flow Utf8Utility.Helpers

* PR feedback: Fix typos

* PR feedback: CountNumberOfLeadingAsciiBytesFrom24BitInteger

* PR feedback: Remove redundant endianess checks

* PR feedback: Validate nint definitions

* PR feedback: Clarify charIsNonAscii vector usage

* PR feedback: document tempUtf8CodeUnitCountAdjustment usage

* Fix compilation failure in Utf16Utility

* PR feedback: Clarify 3-byte sequence processing

* Add missing check to 3-byte processing logic

* Clarify comment in 3-byte processing


Commit migrated from dotnet/coreclr@77a09eb
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UTF8Encoding drops bytes during decoding some input sequences
8 participants